Introduction

Optimal Transport (1D)

Consider two PDFs \(f\) and \(g\) and their CDFs noted \(F\) (with \(dF(x)=f(x)\ dx\)) and \(G\) (with \(dG(x)=g(x)\ dx\)) respectively, for random variables \(x\) and \(y\).

Problem: How to find a (bijective and differentiable) transformation \(\phi\) such that \(y=\phi(x)\) for two random variables \(x\sim f(x)\) and \(y \sim g(y)\)?

\[ \begin{array}{lc} &\\ \text{Solution (1D):} & \phi(x)=G^{-1}\circ F(x)\\ &\\ \end{array} \]

This solution has also been shown to minimize the transport cost (Quadratic Wasserstein) :

\[ \mathcal{W} = \mathbb{E}_{(x,y)\sim\pi} \left[ |x-y|^2 \right\rbrack \quad \text{s.t.}\quad f=\int \pi\ dy \quad \text{and}\quad g=\int \pi\ dx \]

Applications (1D)

Example Equalization: \(y=\phi(x)=Id\circ F(x)\) using \(g\equiv \mathcal{U}\) (uniform) \(G=Id=G^{-1}\) (identity function)

Example Bridge/synthesis: \(y=\phi(x)=G^{-1}\circ Id(x)\) synthesis samples \(y\sim g\) from samples \(x\) easily generated \(x\sim \mathcal{U}\)

Iterative Distribution Transfer

The solution \(\phi(x)=G^{-1}\circ F(x)\) works for r.v. \(x,y\in \mathbb{R}\) but what about if \(\mathbf{x},\mathbf{y}\in\mathbb{R}^d\) ?

  • Iterative Distribution Transfer (IDT) algorithm proposed by Pitie et al uses the 1D solution iteratively by projecting multidimensional data for \(x\) and \(y\) onto 1D spaces.

  • The directions for projection are chosen orthogonal spanning \(\mathbb{R}^d\), potentially with redundancy. To vary these directions from one iteration to the next, a random rotation is used.

  • IDT has been applied to colour pixels (\(d=3\)).

\(x\in\mathbb{R}^3\) \(y\in\mathbb{R}^3\) \(\phi_{IDT}(x)\in\mathbb{R}^3\)

Iterative Distribution Transfer

Iterative Distribution Transfer

Using Zhang et al (2021) taxonomy, :

Optimal Transport Methods
│
├───Regularization based 
│   └─── ...
│
├───Projection based 
   ├───Random Projection 
   │    └───Iterative Distribution Transfer (IDT)
   │
   ├───Projection Pursuit 
   
A Review on Modern Computational Optimal Transport Methods with Applications in Biomedical Research, J. Zhang et al 2021 [arXiv:2008.02995.pdf]

Transfer function \(\phi_{OT}\)

Solution \(\phi_{OT}(x)=G^{-1}\circ F(x)\) may not be capturing well the actual process (i.e. joint distribution \(\pi(x,y)\)) that generated the data \(\lbrace (x^{(i)},y^{(i)})\rbrace_{i=1,\cdots,n}\).

The solution \(\phi_{OT}\) (green curve) only sees datasets \(\lbrace x^{(i)}\rbrace\) and \(\lbrace y^{(i)}\rbrace\) (blue points) without any knowledge of correspondences \(\lbrace (x^{(i)},y^{(i)})\rbrace\) (black points). The solution \(\phi_{OT}\) is by definition bijective (hence is monotone).
Patch-Based Colour Transfer with Optimal Transport. Alghamdi, Grogan & Dahyot [DOI:10.23919/EUSIPCO.2019.8902611]

Transfer function \(\phi_{NW}\)

\[ \hat{y}=\phi_{NW}(x)=\mathbb{E}[y|x] =\int y \ p_{y|x}(y|x) dy \simeq \frac{\sum_{i=1}^n y^{(i)} \ k_h(x-x^{(i)})}{\sum_{i=1}^n k_h(x-x^{(i)})} \]

The solution \(\phi_{NW}\) (blue curve) takes advantage of correspondences \(\lbrace (x^{(i)},y^{(i)})\rbrace\) (black points) that may be available. \(\phi_{NW}\) is not monotone and its smoothness is controlled by the bandwidth \(h\) of the kernel.

Transfer function \(\phi_{L2}\)

We have recently proposed to fit a spline function with correspondences \(\phi_{L2}^{Corr}\) and without \(\phi_{L2}\) that minimizes the robust \(L2\) distance between GMMs \(f\) and \(g\).
Sliced L2 Distance for Colour Grading Alghamdi & Dahyot [arXiv:2102.09297, EUSIPCO 2021 to appear]

Deep Neural Networks

In DNNs, the machine model corresponds to a function \(\mathcal{M}_{\theta}\) designed by compositions of a linear and non linear functions:

\[ \mathcal{M}_{\theta}(x)= \mathrm{W}_L \circ \cdots \underbrace{\phi_l \circ \mathrm{W}_l}_{\text{Layer } l} \circ \cdots \circ \phi_1 \circ \mathrm{W}_1 (x) \] Layer \(l\) : linear transformation followed by a point wise non linear activation: \[ \phi_l(x)=\left(\phi_{l,1}(x_1),\cdots,\phi_{l,N_l}(x_{N_l}) \right) \] where the scalar map \(\phi_{l,n}:\mathbb{R} \rightarrow \mathbb{R}\) is the activation or transfer function of the neuron indexed by \((l, n)\).

Transfer (Activation) Functions

In a conventional setup, the activation function is fixed: \[ \phi_{l,n}(x)=\phi(x-b_{l,n}) \]e.g. with \[ \begin{array}{lc} \text{softplus: }&\phi(x)=\log\left(1+\exp(x)\right)\\ \text{ReLu: }&\phi(x)=\max(0,x)=x^+\\ \end{array} \]

Learnt activation functions

Future works

  • Is learning direction of projections needed? (cf. random projections in IDT, and wavelets chosen in scattering networks)

  • Is end to end optimization needed for learning activation functions?

  • Can (cost effective and accurate) DNNs be designed using pre-set projections + learnt activation functions?